EE 377/STATS 311 Project Report: Learning from Different Domains

نویسندگان

  • Farzan Farnia
  • Nishal Shah
  • Milind Rao
چکیده

Most machine learning algorithms are theoretically designed and practically applied with the consideration that the training data and test data arise from a common distribution. In several applications in natural language processing ranging from information extraction or sentiment analysis, classification algorithms are trained using a limited set of documents and applied generally to works spanning different genres. Also, in genetics, most experiments are done and predictive methods developed for people from a particular geography but this is broadly applied to all humans whose genetic makeup arises from a different distribution. It is of interest to learn how efficiently algorithms trained on samples from a source distribution perform on test data arising from a target distribution. In [1], the authors investigate this problem of domain adaptation. The authors first consider the problem of binary classification trained on labeled data from a source distribution. Classification error bounds are presented in terms of source domain error and divergence measures between the two distributions. Unlabeled data from either domain is then leveraged to estimate the divergence between domains to present a second bound. The third question the authors answer is learning with different amounts of labeled target and source data and they do this via a hypothesis that minimizes a convex combination of the error. We aim to extend the results of the paper in a few directions. In Section 2, we introduce the problem setup. We then present a simple upper bound on the generalization error with different domains. The paper provides this result for only binary function with 0-1 loss. We extend it to arbitrary functions and a large class of loss functions. Seeking to make this bound tighter, we define a classifier based distance for a larger class of hypothesis and loss functions. We now extend the results by giving guarantees on an algorithm that seeks to minimize a convex combination of training error and test error. We attempt to use this bound to show what amount of regularization is appropriate. In Section 3, we present an alternate view of looking at the correct amount of regularization. Finally, we validate some of the theoretical results with simulations in Section 4. Conclusions are presented in Section 5.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Final Project Report: Real Time Tennis Match Prediction Using Machine Learning

This project adopts an innovative data model by combining both historical match data and real-time stats, and apply machine learning to predict tennis match outcomes. Specifically, we explore and compare four data models mixing historical data and real-time stats, while applying machine learning techniques such as logistic regression, support vector classification (SVC) with linear, RBF and pol...

متن کامل

Image alignment via kernelized feature learning

Machine learning is an application of artificial intelligence that is able to automatically learn and improve from experience without being explicitly programmed. The primary assumption for most of the machine learning algorithms is that the training set (source domain) and the test set (target domain) follow from the same probability distribution. However, in most of the real-world application...

متن کامل

Draft Genome Sequences of Four Nosocomial Methicillin-Resistant Staphylococcus aureus (MRSA) Strains (PPUKM-261-2009, PPUKM-332-2009, PPUKM-377-2009, and PPUKM-775-2009) Representative of Dominant MRSA Pulsotypes Circulating in a Malaysian University Teaching Hospital

Here, we report the draft genome sequences of four nosocomial methicillin-resistant Staphylococcus aureus strains (PPUKM-261-2009, PPUKM-332-2009, PPUKM-377-2009, and PPUKM-775-2009) isolated from a university teaching hospital in Malaysia. Three of the strains belong to sequence type 239 (ST239), which has been associated with sustained hospital epidemics worldwide.

متن کامل

Reimagining statistical analysis for evidenced-based policy making: Early experiences using Stats Report

To address these challenges in the NEP, we used an online web application called Stats Report. Built on the R statistical package, Stats Report allows complex data analysis to be undertaken more easily and collaboratively. We used Stats Report in data analysis workshops with government staff in Malawi, Mali, Mozambique, and Tanzania. Statistical experts from our team prepared analysis code, and...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016